Search CORE

30 research outputs found

Example file format of training dataset used in machine learning.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

There is one protein per line that consists of the total binding affinity score for each peptide-MHC length combination e.g. 304 combinations for 76 common MHC I alleles (MHC I binds to peptides, typically eight to eleven amino acid residues in length. Therefore, 76 alleles * 4 peptide lengths = 304 combinations). Binding affinity score = an IEDB IC50 (nM) score <5000. Each score is weighted by the length of the protein. The scores represent input variables or predictors. The last column is a 1 or 0 that indicates an expected ‘YES’ or ‘NO’ vaccine candidacy and represents the target variable. This expectation is based on the subcellular location annotation associated with the protein in UniProtKB (secreted or membrane-associated = 1, internal location = 0).</p

FigShare

Comparison of test genes not identified by gene finders.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

++Number of groups of test genes not found in which the test genes are located consecutively along the chromosome.

The highest number of test genes in a consecutive group.</p

FigShare

Comparison of genomic start and end locations of gene predictions with 299 test genes.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Abbreviations:gm = GeneMark_hmm, aug = AUGUSTUS, gl = GlimmerHMM.**See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0050609#pone-0050609-g005" target="_blank">Figure 5</a> for explanation on classifications.++Number of predicted genes that predict part of an entire gene such that there can be more than one prediction to the same test gene.

Number of predictions that did not overlap the test genes in any way.</p

FigShare

Schematic representation of gene prediction evaluation at the exon level.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Exons are represented by shaded rectangles. Introns are represented by the adjoining solid lines. Abbreviations: TP = true positive, FP = false positive, and FN = false negative.</p

FigShare

Number of BLASTX hits using DNA consensus sequences from AUGUSTUS and GlimmerHMM predictions.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

The figure shows the BLASTX hits when using the consensus of predicted sequences from AUGUSTUS and GlimmerHMM as queries in an attempt to find novel Toxoplasma gondii proteins. These consensus sequences were derived from aligning predicted DNA sequences based on overlapping genomic locations (see text for details).</p

FigShare

Example of rule-based approach applied to highest affinity peptide on each test protein.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Proteins are listed in ascending order based on the lowest IC50 (nM) binding affinity score. A threshold value e.g. 1.5 is applied to the score to segregate the list into two classifications. Below the threshold is ‘YES’ for vaccine candidacy and above is ‘NO’. The rule-based classification is compared with the expected classification to determine performance accuracy. Threshold value is derived from a trial-and-error approach with the intention to classify the greatest number of true positives and negatives.</p

FigShare

Plot of conservation scores computed for binding peptides along a protein (UniProtKB ID: P13664).

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Each circle represents the amino acid conservation score computed at a sliding window. The window is of length 9 and slides one residue at a time. The colour of the circle represents binding affinities against 76 common MHC alleles computed at each window. A window (i.e. a peptide) can theoretically bind to all 76 alleles and colours are therefore plotted in a set order: no, low, intermediate, and high affinity. For example, a dark blue circle for low affinity indicates there are no intermediate or high affinity peptides at the window; however, a green circle for high affinity provides no indication of other affinities at the same window. Mean conservation = 0.7805; median conservation = 0.7946. For protein P13664 (Major surface antigen p30) 54.6% high, 56% intermediate, and 55.9% low binders have conservation scores below the mean. The study shows that vaccine candidates are significantly more likely to have either a greater number of less conserved peptides or a lower total conservation score than non-vaccine candidates.</p

FigShare

Example of online output from IEDB peptide-MHC class I binding predictor.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

The binding predictor conceptually slides a window of a user-defined length (either eight to eleven amino acid residues) one residue at a time from the start of the protein sequence. An affinity score is predicted for the ability of each fixed-length subsequence (as defined by each position of the sliding window) to bind to a user-specified MHC I allele. Fig. 1 shows the output when a sequence (e.g. MARHAIFFALCVLGL…) is input into the program to predict if it contains peptides of length 9 that bind to the MHC allele, HLA-A*11∶01. The IC50 (nM) affinity scores for subsequence ‘MARHAIFFA’ at position 1 to 9 are highlighted.</p

FigShare

Sensitivity and specificity for random forest tests applied to peptide-MHC binding scores for vaccine classification of Benchmark dataset.

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Abbreviations: (R) = target variable e.g. 1 or 0 in training data randomly changed for each protein, HE = hold-out dataset error (%) i.e. error when predicting 30% of training data, OE = overall error (%) i.e. percentage of incorrect predictions, SN = sensitivity (%) = true positives/(true positives+false negatives), SP = specificity (%) = true negatives/(true negatives+false positives).aCross-validation involved a random sample of 70% from training dataset to build predictive model and remaining 30% used for testing. This was repeated 10 times and predictions averaged (predictions for the same input data fluctuate unless a random seed is set initially).bBenchmark are proteins from published studies with known or expected T-cell responses (source species: T. gondii) –100% from training data used to build predictive model.Note: Number of input variables used to build predictive model = 304 (i.e. number of allele-peptide length combinations derived from 76 common alleles).Sensitivity and specificity for random forest tests applied to peptide-MHC binding scores for vaccine classification of Benchmark dataset.</p

FigShare

Number of matching predicted genes with 299 test genes using BLASTN (with 250, 500, and 1000 training genes).

Author: John T. Ellis (114797)
Paul J. Kennedy (114795)
Stephen J. Goodswen (114793)
Publication venue
Publication date
Field of study

Abbreviations:gl = GlimmerHMM; aug = AUGUSTUS.N/A = not applicable – the AUGUSTUS training program does not give the option to control the number of bases that precede and follow the coding segment (CDS) sequence of the training genes.

Number of predicted genes that align entirely or partly with the test genes and meet the criteria E-value = 0 and 100% coverage – a value in brackets is the number of predicted genes that are exactly the same as the test genes i.e. each exon genomic coordinate is the same.++Number of predicted genes that align to the same test gene i.e. the predicted gene is only a part of the entire test gene and there can be one or more predictions per test gene.The values underlined indicate the highest number of matches for each gene finder.</p

FigShare